Your task is to analyze game and player statistics from the NBA 2008-2009 season provided as a relational database of play-by-play data.
You can use jupyter to execute the queries. To write and test the queries, use the web interface at http://pollux.vo.elte.hu/basketball/. Copy queries into the worksheet for submission.
You can compare your results with https://en.wikipedia.org/wiki/2008%E2%80%9309_NBA_season
! pip install --user pymssql
import pymssql
import pandas as pd
import holoviews as hv
from holoviews import opts
hv.extension('bokeh')
# This opens a connection to the MS SQL server
conn = pymssql.connect(server='kooplex-temp.elte.hu',
port=1433,
user='basketball_user',
password='sayhaTU56#@dfsS',
database='Basketball')
# This closes it
# conn.close()
# We need a cursor that contains the state of our query(s)
cursor = conn.cursor()
# A test query
res = cursor.execute('select * from shot;')
print(cursor.fetchone())
a) List all teams playing in the league during the season
b) List teams in alphabetic order but division by division
c) Count the number of persons/players in the league
# queries
tables = pd.read_sql('select * from INFORMATION_SCHEMA.tables', conn)
columns = pd.read_sql('select * from INFORMATION_SCHEMA.columns', conn)
cols = columns[['TABLE_NAME','COLUMN_NAME', 'DATA_TYPE']]
# Print tables with holoviews, so it is easier to explore each table's structure
key_dimensions = ['TABLE_NAME']
value_dimensions = ['COLUMN_NAME', 'DATA_TYPE']
hv_data = hv.Table(cols, key_dimensions, value_dimensions)
#t1 = hv_data.to.table(['TABLE_NAME'], [] )
table_names = hv.Div(data="<br>".join(cols['TABLE_NAME'].unique()))
t2 = hv_data.to.table(['COLUMN_NAME', 'DATA_TYPE'])
tt = table_names + t2
tt
###### a)
team = pd.read_sql("SELECT * FROM team;",conn)
team.head(10)
###### b)
team_alphabetic = pd.read_sql("""SELECT *
FROM team
ORDER BY division_id, name;""",conn)
team_alphabetic.head(10)
###### c)
players = pd.read_sql("""SELECT COUNT(person_id)
FROM player;""",conn)
persons = pd.read_sql("""SELECT COUNT(*)
FROM person;""", conn)
persons, players #it is not the same
a) List teams by division, including the division name
b) List players by team including the uniform number
tt
###### a)
team_by_div = pd.read_sql("""
SELECT team.name, division.name
FROM division
LEFT JOIN team
ON division.id=team.division_id
ORDER BY division.name;
""", conn)
team_by_div.head(10)
###### b)
pd.read_sql("""
SELECT person.name, player.number, team.name
FROM person
LEFT JOIN player
ON player.person_id=person.id
LEFT JOIN team
ON team.id=player.team_id
ORDER BY team.name, person.name;""",conn)
a) Count number of persons playing for a given team
b) List all teams along with the number of players
c) Count the number of teams a given person played for
d) List all players along with the number of teams they played for
e) List players who played for more than two teams during the season
###### a) Let this given team be Atlanta Hawks
pd.read_sql("""
SELECT COUNT( DISTINCT person_id ) num_of_players
FROM player
WHERE team_id = 'ATL'
""",conn)
###### b)
b = pd.read_sql("""
SELECT COUNT(id) num_of_players, team_id
FROM player
GROUP BY team_id
""",conn)
print(b.num_of_players.sum())
b
###### c)
pd.read_sql("""
SELECT person_id, COUNT(team_id)
FROM player
GROUP BY person_id
""", conn)
###### d)
pd.read_sql("""
SELECT player.person_id, person.name, COUNT(team_id)
FROM player
INNER JOIN person
ON player.person_id = person.id
GROUP BY player.person_id, person.name
""", conn)
###### e)
pd.read_sql("""
SELECT player.person_id, person.name, COUNT(team_id)
FROM player
INNER JOIN person
ON player.person_id = person.id
GROUP BY player.person_id, person.name
HAVING COUNT(team_id) > 2
""", conn)
a) Pick a game and list all field shots and free throw attempts, including points
b) Combine the previous two into a single table and only show successful attempts
c) Take previous query and sum all points by team so that end result is available
d) Remove filter on game and get end result of each game during season
e) Modify the previous query to return a single line per game, with home team and away team points
###### a) All shots and attempts
pd.read_sql("""
SELECT * FROM shot
LEFT JOIN point
ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)
WHERE shot.game_id = 1
""",conn)
##### a) All freethrows and attempts
pd.read_sql("""
SELECT * FROM freethrow
LEFT JOIN point
ON (freethrow.cntr = point.cntr AND freethrow.game_id = point.game_id)
WHERE freethrow.game_id = 1
""",conn)
###### b)
pd.read_sql("""
SELECT * FROM point
LEFT JOIN shot
ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)
LEFT JOIN freethrow
ON (freethrow.cntr = point.cntr AND freethrow.game_id = point.game_id)
WHERE point.game_id = 1
""",conn)
###### c)
pd.read_sql("""
SELECT SUM(point.points), CONCAT(shot.team_id, freethrow.team_id) FROM point
LEFT JOIN shot
ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)
LEFT JOIN freethrow
ON (freethrow.cntr = point.cntr AND freethrow.game_id = point.game_id)
WHERE point.game_id = 1
GROUP BY CONCAT(shot.team_id, freethrow.team_id)
""",conn)
###### d)
pd.read_sql("""
SELECT SUM(point.points) points, CONCAT(shot.team_id, freethrow.team_id) team_id, point.game_id
FROM point
LEFT JOIN shot
ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)
LEFT JOIN freethrow
ON (freethrow.cntr = point.cntr AND freethrow.game_id = point.game_id)
GROUP BY CONCAT(shot.team_id, freethrow.team_id), point.game_id
ORDER BY point.game_id
""",conn)
'''e = pd.read_sql("""
SELECT SUM(point.points) points, CONCAT(shot.team_id, freethrow.team_id) team_id, point.game_id, home_team_id, away_team_id
FROM point
LEFT JOIN shot
ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)
LEFT JOIN freethrow
ON (freethrow.cntr = point.cntr AND freethrow.game_id = point.game_id)
LEFT JOIN game
ON (game.id = point.game_id
AND ( CONCAT(shot.team_id, freethrow.team_id) = home_team_id
OR CONCAT(shot.team_id, freethrow.team_id) = away_team_id ))
GROUP BY CONCAT(shot.team_id, freethrow.team_id), point.game_id, home_team_id, away_team_id
ORDER BY point.game_id
""",conn)
'''
###### e) Away and home team points in one line
e = pd.read_sql("""
SELECT A.game_id , A.points away_team_points, A.team_id away_team_id,
B.points home_team_points, B.team_id home_team_id
FROM
(SELECT SUM(point.points) points, CONCAT(shot.team_id, freethrow.team_id) team_id,
point.game_id, home_team_id, away_team_id
FROM point
LEFT JOIN shot
ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)
LEFT JOIN freethrow
ON (freethrow.cntr = point.cntr AND freethrow.game_id = point.game_id)
LEFT JOIN game
ON (game.id = point.game_id
AND ( CONCAT(shot.team_id, freethrow.team_id) = home_team_id
OR CONCAT(shot.team_id, freethrow.team_id) = away_team_id ))
GROUP BY CONCAT(shot.team_id, freethrow.team_id), point.game_id, home_team_id, away_team_id) A,
(SELECT SUM(point.points) points, CONCAT(shot.team_id, freethrow.team_id) team_id,
point.game_id, home_team_id, away_team_id
FROM point
LEFT JOIN shot
ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)
LEFT JOIN freethrow
ON (freethrow.cntr = point.cntr AND freethrow.game_id = point.game_id)
LEFT JOIN game
ON (game.id = point.game_id
AND ( CONCAT(shot.team_id, freethrow.team_id) = home_team_id
OR CONCAT(shot.team_id, freethrow.team_id) = away_team_id ))
GROUP BY CONCAT(shot.team_id, freethrow.team_id), point.game_id, home_team_id, away_team_id) B
WHERE (A.game_id = B.game_id) AND (A.team_id = A.away_team_id) AND (B.team_id = B.home_team_id)
ORDER BY A.game_id
""",conn)
e # I would be really glad if someone would tell me how could I make this query without copy-pasting the previous query
# for the self join
e[e["home_team_id"] == "BOS"] #I looked up the results for Boston on Wiki if they match, and they do.
Choose one from the following and visualize it!
a) Show the "hottest" positions on the floor where the players scored most! (Create a 2d histogram of the field)
b) What properties of a player are the most correlated with their scoring/assisting/rebound performance? Select some properties and create an interactive visualization for it, that shows the relation between these terms!
c) Create pie charts on the US map that shows for each teams' geolocation their average received/scored point ratio!
Don't forget to comment on the data sources (tables) you used and the method how you obtained the values!
import matplotlib.pyplot as plt
from matplotlib.colors import LogNorm
import seaborn as sns
shots = pd.read_sql("""
SELECT * FROM shot
INNER JOIN point
ON (shot.cntr = point.cntr AND shot.game_id = point.game_id)
""",conn)
shots
print(shots.isna().sum()) #We have missing values in x and y.
shots_xy = shots[["x","y"]]
shots_xy = shots_xy.dropna() #I'll just get rid of them until possible later analysis
plt.figure(figsize=[10,10])
a = plt.hist2d(shots_xy.x, shots_xy.y, bins = (50,50), norm = LogNorm())
plt.colorbar()
plt.grid()
#2D histogram of shots
import plotly.graph_objects as go
from PIL import ImageColor
colors = ["#440154", "#3D4D8A", "#25838E", "#37B878", "#BADE28"]
ticks = [0, 10**(-3), 10**(-2), 10**(-1), 1]
colorscale = [ [ticks[i], "rgb"+str(ImageColor.getcolor(colors[i], "RGB"))] for i in range(len(ticks))]
# I tried to mimic the matplotlib colorscale
# colorscale is like [[0, "rgb(255,255,255)"],
# ...,
# [1, "rgb(0,0,0)"] ]
fig = go.Figure(go.Heatmap(
x = a[2],
y = a[1],
z = a[0],
colorscale = colorscale))
fig.show()
a) Print season results by division. For each team, calculate with a single_ query
Compare with https://en.wikipedia.org/wiki/2008–09_NBA_season#By_division